2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.
translated by 谷歌翻译
现有的步态识别研究以实验室场景为主。由于人们生活在现实世界中,因此野外的步态识别是一个更实用的问题,最近引起了多媒体和计算机视觉社区的关注。在现有基准上获得最先进性能的当前方法在最近提出的野外数据集上的准确性差得多,因为这些方法几乎无法模拟不受约束场景中步态序列的各种时间动力学。因此,本文提出了一种新型的多跳时间开关方法,以实现实际场景中步态模式的有效时间建模。具体来说,我们设计了一个新型的步态识别网络,称为多跳临时交换机网络(MTSGait),以同时学习空间特征和多尺度的时间功能。与现有的3D卷积进行时间建模的方法不同,我们的MTSGAIT通过2D卷积对步态序列的时间动力学进行建模。通过这种方式,与基于3D卷积的模型相比,它以较少的模型参数来达到高效率,并减少了优化的难度。基于2D卷积内核的特定设计,我们的方法可以消除相邻帧之间特征的不对准。此外,提出了一种新的采样策略,即非环保连续采样,以使模型学习更强大的时间特征。最后,与最新方法相比,提出的方法在两个公共步态数据集(即增长和步态3D)上取得了出色的性能。
translated by 谷歌翻译
Panoptic图像分割是计算机视觉任务,即在图像中查找像素组并为其分配语义类别和对象实例标识符。由于其在机器人技术和自动驾驶中的关键应用,图像细分的研究变得越来越流行。因此,研究社区依靠公开可用的基准数据集来推进计算机视觉中的最新技术。但是,由于将图像标记为高昂的成本,因此缺乏适合全景分割的公开地面真相标签。高标签成本还使得将现有数据集扩展到视频域和多相机设置是一项挑战。因此,我们介绍了Waymo Open DataSet:全景视频全景分割数据集,这是一个大型数据集,它提供了用于自主驾驶的高质量的全景分割标签。我们使用公开的Waymo打开数据集生成数据集,利用各种相机图像集。随着时间的推移,我们的标签是一致的,用于视频处理,并且在车辆上安装的多个摄像头保持一致,以了解全景的理解。具体而言,我们为28个语义类别和2,860个时间序列提供标签,这些标签由在三个不同地理位置驾驶的自动驾驶汽车上安装的五个摄像机捕获,从而导致总共标记为100k标记的相机图像。据我们所知,这使我们的数据集比现有的数据集大量数据集大的数量级。我们进一步提出了一个新的基准,用于全景视频全景分割,并根据DeepLab模型家族建立许多强大的基准。我们将公开制作基准和代码。在https://waymo.com/open上找到数据集。
translated by 谷歌翻译
已经证明了现代自动驾驶感知系统在处理互补输入之类的利用图像时,已被证明可以改善互补投入。在孤立中,已发现2D图像非常容易受到对抗性攻击的影响。然而,有有限的研究与图像特征融合的多模态模型的对抗鲁棒性。此外,现有的作品不考虑跨输入方式一致的物理上可实现的扰动。在本文中,我们通过将对抗物体放在主车辆的顶部上展示多传感器检测的实际敏感性。我们专注于身体上可实现的和输入 - 不可行的攻击,因为它们是在实践中执行的可行性,并且表明单个通用对手可以隐藏来自最先进的多模态探测器的不同主机。我们的实验表明,成功的攻击主要是由易于损坏的图像特征引起的。此外,我们发现,在将图像特征中的现代传感器融合方法中,对抗攻击可以利用投影过程来在3D中跨越区域产生误报。朝着更强大的多模态感知系统,我们表明,具有特征剥夺的对抗训练可以显着提高对这种攻击的鲁棒性。然而,我们发现标准的对抗性防御仍然努力防止由3D LIDAR点和2D像素之间不准确的关联引起的误报。
translated by 谷歌翻译
Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions.
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
Learning feature interactions is the key to success for the large-scale CTR prediction and recommendation. In practice, handcrafted feature engineering usually requires exhaustive searching. In order to reduce the high cost of human efforts in feature engineering, researchers propose several deep neural networks (DNN)-based approaches to learn the feature interactions in an end-to-end fashion. However, existing methods either do not learn both vector-wise interactions and bit-wise interactions simultaneously, or fail to combine them in a controllable manner. In this paper, we propose a new model, xDeepInt, based on a novel network architecture called polynomial interaction network (PIN) which learns higher-order vector-wise interactions recursively. By integrating subspace-crossing mechanism, we enable xDeepInt to balance the mixture of vector-wise and bit-wise feature interactions at a bounded order. Based on the network architecture, we customize a combined optimization strategy to conduct feature selection and interaction selection. We implement the proposed model and evaluate the model performance on three real-world datasets. Our experiment results demonstrate the efficacy and effectiveness of xDeepInt over state-of-the-art models. We open-source the TensorFlow implementation of xDeepInt: https://github.com/yanyachen/xDeepInt.
translated by 谷歌翻译
In this paper, we study the problem of knowledge-intensive text-to-SQL, in which domain knowledge is necessary to parse expert questions into SQL queries over domain-specific tables. We formalize this scenario by building a new Chinese benchmark KnowSQL consisting of domain-specific questions covering various domains. We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples. More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing. Experiments using ReGrouP demonstrate a significant 28.2% improvement overall on KnowSQL.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译